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Automated Information Retrieval in 
Science and Technology 


Tamas E. Doszkocs, Barbara A. Rapp, Harold M. Schoolman 


In the past 15 years, automated infor- 
mation retrieval systems for science and 
technology have been developed at a re- 
markable rate. Today more than 1000 
data bases are available for comput- 
erized searching. Although more than 2 

: million searches of these data bases will 
ibe made this year, many scientists and 


MEDLARS became operational in 1964. 
Aided by GRACE (Graphic Arts Com- 
posing Equipment) (7), the August 1964 
Index Medicus was electronically type- 
set in 18 hours. 

The automated bibliographic data base 
used for publications was also available 
for machine searching. The retrieval sys- 


Summary, The rapid advances in computer and communication technology in the 
1970's have enabled large interactive scientific and technical information retrieval 
systems to be implemented. Major search services today offer on-line access to mil- 
lions of bibliographic citations and an increasing number of “electronic handbooks.” In 
addition, development of knowledge bases is well under way. Despite the impressive 
speed and flexibility of interactive retrieval systems, their impact has been lessened 
by limited awareness of their existence, uneven quality of retrieval, inadequate link- 
ages among data bases, and reliance on specially trained intermediaries. 


technologists are either unaware of them 


; or use them in a superficial manner. This 
: article describes the development of data 


Cd 


bases at the National Library of Medi- 
cine (NLM) to illustrate the evolution, 


: present capabilities, and potential of 


* automated 


information retrieval sys- 


~ tems. 


In 1979, the NLM celebrated the cen- 
tennial of John Shaw Billings’ launching 
of Index Medicus, derived from his mon- 
umental Index Catalogue of the Surgeon 
General's Office. Billings, with help from 
Robert Fletcher, indexed 20,000 articles 
selected from 570 medical journals 
throughout the world. Today, the NLM 
indexes more than 20,000 articles a 
month from approximately 3,000 jour- 
nais selected from the more than 20,000 
received. 

In the late 1950's F. B. Rogers, then 
director of the NLM, recognized that the 
usefulness of the library’s bibliographic 
publications was being threatened by the 
ever-increasing volume of medical litera- 
ture to be processed. He therefore initi- 
ated the mechanization that became 
MEDLARS (Medical Literature Analysis 
and Retrieval System) to support 
the library's bibliographic publications. 


as 
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tem permitted complex searches to be 
processed in a batch mode (that is, with- 
out direct interaction with the comput- 
er). In 1970, at the height of this activity, 
approximately 18,000 scarches were 
made. The turnaround time to users was 
40 to 60 days. 

In 1970 and 1971, the NLM conducted 
the first experiments in on-line access to 
its bibliographic data base (2}. In 1971, 
MEDLINE (MEDLARS or-line) be- 
came operational. Interactive seaiching 
from remote terminals in 10 regional li- 
braries and 14 large academic medical li- 
braries was now possible. Today the 
service supports on-line access by more 
than 1000 domestic and foreign centers 
conducting about 1.5 million searches of 
the NLM’s 19 data bases each year 
(Table 1). 

By 1973, the NLM’s MEDLINE expe- 
rience (3) had demonstrated, the potential 
for commercial development of other on- 
line scientific search systems and ser- 
vices. Rapid advances in computer and 
communications technology made wide- 
spread availability and use of on-line re- 
trieval systems feasible. 

‘The development of telecommunica- 


cost of on-line systems. In the United 


. States, these networks provide links to 


computers throughout the country. The 
connection is typically established by a 
local telephone call to the nearest net- 
work access node. 

Although most on-line daia bases con- 
tain bibliographic citations to the pub- 
lished literature, many data bases also 


{ 


provide numeric data that can be used to { 


answer a specific question. ‘‘Knowl- 
edge”’ bases, which contain an analysis 
and synthesis of published information in 
a given field, are also beginning to 
emerge. 

Bibliographic data bases are roughly 
analogous to printed indexes such as Jn- 
dex Medicus or to a library's card cata- 
log; data banks may be thought of as 
automated reference manuals; and 
knowledge bases are rough equivalents 
of textbooks or state-of-the-art reviews. 
Each of these data base types is dis- 
cussed in greater detail below. 


Bibliographic Data Bases 


Currently there are 528 publicly avail- 
able bibliographic or bibliographic-re- 
lated data bases (4). They contain more 
than 70 million citations or records and 


span many subject areas, including life 


sciences, chemistry, agriculture, energy, 
the environment, engineering, electron- 
ics, physics, geoscience, astronomy, 
toxicology, and pharmacology. The ma- 
jority can be searched on-line. 

In the United States, on-line access to 
a number of data bases is provided by 
three major commerical vendors: Lock- 
heed Information Systems, Bibliograph- 
ic Retrieval Services (BRS), and Sys- 
tems Development Corporation (SDC). 
There is considerable, overlap among 
these vendors. Of the 12] data bases, 112 
are offered by Lockheed, 60 by SDC, 


and 30 by BRS. The cost of these ser-, 


vices ranges from about $8 to $120 for 
each hour the user is connected to the 
vendor's computer. An additional charge 
is made for the use of the telecommuni- 
cations network. This charge varies from 
approximately $3 to $6 per hour and is 
not affected by the distance involved, 
Access to some data bases is offered on- 
ly by their producers. The NLM serves 
as both data base producer and distrib- 
utor and offers access to 19 different data 
bases (5). 


Dr, Doszkocs is Chief of the Technical Services 
Division, Ms. Rapp is a librarian in the Bibliographic 
Services Division, and Dr. Schoolman is Deputy Di- 
rector for Research and Education at the National 
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ences to the published literature and are 
most often used us tools that guide one to 
a journal report. For the most part, these 
systems are on-line versions of existing 
indexing and abstracting services such as 
Engineering Index, Chemical Abstracts 5 
Index Medicus, Science Citation Index, 
‘and Government “Reports Announce- 
ments Index. Some combine information 
from several sources and others provide 
access to a part of only one data base. 

Boolean logic is used in searching bib- 
liographic data bases. Several search 
terms can be combined by using the set 
Operators AND, OR, and NOT so that so- 
phisticated searches can be made. Print- 
ed indexes rely on human or machine- 
aided indexing with carefully controlled 
vocabularies (thesauri). Although conven- 
tional indexing is used in many on-line 
data bases, some can also be searched 
by using words or parts of words that 
actually appear in the text. For instance, 
while some automated data bases pro- 
vide only title, author, and source infor- 
mation, many include a short abstract of 
_the article or report. In this case, the text 
words in the abstract as well as in the 
litle may be searchable. 

In MEDLINE, access is possible 
through authors, index terms from a the- 
saurus, or words in the title or abstract. 
A search can be limited to certain pub- 
lication years, languages, or journal ti- 
tles. The searcher may also specify sex, 
age or research with human or animal 
subjects. Either specialized or general 
bibliographies can be prepared. in addi- 
tion, the current month’s additions to the 
data base may be searched as a separate 
file. This file is called SDILINE (Selec- 
tive Dissemination of Information on- 
line). 

Complex MEDLINE searches are or- 
dinarily performed by trained inter- 
mediaries. The NLM provides a compre- 
hensive training program for its users, 
including a computer-aided instruction 
package. Figure | shows a typical MED- 
LINE interaction between the searcher 
and the retrieval program. 

i jute for Scientific Information 
" provides on-line access to its Science Ci- 
_ tation Index and Social Science Citation 
_ Index through the Lockheed Information 
“ Systems’ DIALOG system. These two 
data bases, which have the same features 
‘as other bibliographic data bases, can al- 
“so be searched for authors and docu- 
- ments that cite or are cited by the indi- 
vidual journal articles. This capability 
represents an important enhancement of 
sudject-based retrieval, since it provides 
,fich associative pathways among scien- 
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‘usts and scientific research develop- 
ments. 

° Although bibliographic data bases sim- 
ilar to those mentioned above are by far 
the most numerous, others are rapidly 
developing. Particularly useful are those 
that provide-summaries of ongoing re- 
search projects and names and addresses 
of the principal investigators. The NLM, 
in cooperation with the National Cancer 
Institute, maintains two such files relat- 
ing to cancer research: (i) CANCER- 
PROJ, containing approximately 16,000 
summaries of ongoing cancer research 
projects in many countries and (ii) 
CLINPROT, containing summaries of 
clinical investigations of new anticancer 
agents and treatments. Two other such 
files are the Smithsonian Science Infor- 


Table 1. Data bases of the NLM and their 
contents. 


Total 


Dates 
Data base records covered 
AVLIN 7,680 Through 1979 
BACK66* 501,802 Jan, 1966 to 
Dec. 1968 
BACK69* 668,258 Jan. 1969to 
Dec. 1971 
BACK72* 669,109 Jan. 1972 to 
Dec. 1974 
BACK75* 642,953 Jan. 1975 to 
Dec. 1978 
BIOETHICS 7,733 Jan. 1963 to 
Sept. 1979 
CANCERLIT 183,433 1976 to 1979 
CANCERPROJ = 18,641 1965 to 1979 
CATLINE 191,053 N/AT 
CHEMLINE 425,112 N/A 
CLINPROT 1,520 N/A 
EPILEPSY 25,635 1945 to present 
HEALTH 125,608 Jan. 1975 to 
Aug. 1979 
HISTLINE 37,256 N/A 
MEDLINE 573,960 Jan. 1977 to 
Oct. 1979 
MESH VOC 14,819 1979 
NAME AUTH — 105,873 1979 
RTECS 36,851 1978 
SDILINE 21,543 October 1979 
SERLINE 33,122 1979 
TDB 2,515 N/A 
TOXLINE 628,743 1950 to 1979 
CBAC 326,675 1974 
TOXBIB 123,868 1974 to Oct. 1979 
IPA 35,265 1974 to July 1979 
HEEP 75,876 1974 
PESTAB 16,063 1974 to June 1979 
EMIC 24,290 1960 to June 1979 
ETIC 15,301 1950to Mar. 1979 
RPROJ 9,567 Nov, 1979 
TOXBACK 379,299 1940 to 1975 
CBAC 167,668 1965 to 1973 
TOXBIB 127,104 1968 to 1973 
IPA 19,188 1970 to 1973 
HEEP 24,680 1971 to 1973 
HAPAB 12,816 1966 to 1973 
HAYES 10,043 1940 to 1966 
TMIC 4,552 1971 to 1975 
TERA 13,248 1960 to 1974 


*MEDLINE back fi Ducat 
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_ contains Chemical 


mation Exchange data base (availab. 
through SDC), which includes informs 
tion .about ongoing government-fundec 
research projects, and the Current Re-: 
search Information System (available © 
through the Lockheed DIALOG SYS- 
tem), which contains data on research in 
many fields. 

A major problem in performing on-line 
searches is that of choosing appropriate 
search terms. The searcher must think of 
all the possible ways to express a con- 
cept in anticipation of the words chosen 
by the author and indexer. An on-line 
dictionary file can be very useful in this 
respect, allowing synonyms for one con- 
cepi to be pulled together. This is espe- 
cially important in specialized areas such 
as chemical nomenclature, in which 
there are more than 7 million chemical 
substances and 40 to 50 ways of identi- 
fying each compound. 

The NLM has two on-line, interactive 
dictionary files: CHEMLINE (chemical 
dictionary off-line) and MESH (medical 
subject headings). CHEMLINE pro- 
vides a mechanism whereby more than 
760,000 chemical names representing 
nearly 415,000 compounds can be 
searched and retrieved on-line. This file 
Abstracts Service 
(CAS) registry numbers, molecular for- 
mulas, preferred chemical index nomen- 
clature, generic and proprietary names 
derived from. the CAS registry nomen- 
clature file, and a locator designation 
that points to other files in the NLM sys- 
tem containing information ona particu- 
lar chemical substance. Where appli- 
cable, each registry number record in 
CHEMLINE contains ring information 
including the number of component rings 
within a ring system, ring sizes, ring ele- 
mental compositions, and component 
line formulas. The user searches CHEM- 
LINE by entering either a chemical 
hame, generic name, trivial name, com- 
mercial name, molecular formula, or 
even a part of a name. The registry num- 
ber information or other nomenclature 
can then be used in performing more 
thorough searches of the other NLM 
files, (for example, TOXLINE, .a file of 
more than 600,000 journal citations and 
abstracts pertaining to toxicology and 
the environment). 

The MESH vocabulary file is an on- 
line thesaurus of medical terminology 
used in indexing journal citations for 
MEDLINE. It is thoroughly cross-in- 
dexed (for example, GUINEA WORM, see 
DRACUNCULUS MEDINENSIS; AVOID- 
ANCE LEARNING, see related ESCAPE RE- 
ACTION, MIREX, see under — IN- 


OYOORDBOORRONOCHLORINE). 
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i Data Banks 


Data bases that contain numeric and 


: analytic data derived from the published 


- literature and references to the source of 


the information are beginning to appear. 
These systems are comparable to hand- 
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bases of this type: RTECS (Registry of 
Toxic Effects of Chemical Substances) 
and TDB (Toxicology Data Bank). The 
RTECS file is the on-line version of a 
compilation published annually by the 
National Institute for Occupational 


data for approximately 36,000 sub- 
stances known by more than 125,000 dif- 
ferent names. The file also contains 
threshold values, recommended stan- 
dards in air, aquatic toxicity, some list- 
ings of toxicological effects, and the 


a apes ne ree *: 


PROG: 
YOU ARE NOW CONNECTED TO THE MEDLINE FILE, 
$$ 1 A? 


USER! (Enter medical subject headings to express primary concept.) 
thromboeabolisa or cerebrovascular disorders 


$$ (1) PSTG (1631) 
S$ 2 /C? 

USER: 
contracertives, oral 
PROG 


$3 (2) PSTG (760) 


(Enter medical subject headings to express primary concept.) 


(Match retrieved sets of records.) 


* 


PROG? 

SS (3) PSTG (40) 
SS 4/0? 

USER: 

niddle age 

$$ (4) PSTG (54985) 
$$ § /C? 

USER: 

3 and not 4 
PROG: 

SS (5) PSTG (28) 
SS & /0? 

USERS 


print 1 include abstract skip 11 
FROG: 


(Enter next major concept.) 


y ema mn ta, 
‘. 


(Exclude the latter item from the retrieved set.) 


(Print the 12th record with full abstract.) 


12 

AU - Hilliard 62 

AU - Norris HJ 

Tl - Pathologic effects of oral contraceptives. 

AB - The pathologic effects of oral contraceptives have heen described 
in this paper and in other reviews [1, 5: 23) 35, 44) 48, 59, 80, 
103, 1141, Aperoxigately 10 million women currently use oral 
contracertives in the United States, These druss are beneficial 
both to the users and for population control. It is their effect 
on the health status of women who take then that must continue to 
have well~orsanized investigation so that bore meaningful 
conclusions concerning their safety wil) permit continued use. In £ 
sone instances, the ratholosic effects of oral contraceptives i 
make it necessary that new methods of contraception be found, : 
Intensive research in this area is needed and judicivos use of i 
oral contraceptives must be saintained. A national resistry : 
should be formed to record and investigate the cases of women who : 
die or have adverse reactions while taking these agents. A ! 
registry miaht identify associations no previously known to exist : 
in patients taking oral contraceptives. It would serve to i 
concentrate the data in one area so that more material would be J 
available for the study of pathogenetic wechanisms. It would : 
heishten patient and physician awareness of the untoward effects 
and increase the responsibilities of the wcaen who take thea to 
monitor their own health, : 

SO - Recent Results Cancer Res 197954249-71 


FROG: 
DONE? 


HSERS 
yes 
TINE 


PROG: 
GO00-BYE! 
4 APRIL 1980 


{YES/NO} 


0:01:58 NLM TIME 17:44:01 
Fig. 1. MEDLINE search on the topic *thrombo- 
embolism or cerebrovascular disorders due to the 
use of oral contraceptives by persons who are not 


middle-aged." 
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tion on a particular chemical substance, 
species, route of administration, and ef- 
fect. Figure 2 shows a sample RTECS 
record, 

The TDB contains chemical, pharma- 
cological, and toxicological information 
and data from some 80 major textbooks 
and handbooks. The data were reviewed 
by scientists knowledgeable in the sub- 
ject matter. The file contains data on 
1122 chemical substances and informa- 
tion on some 1500 additional substances 
for which data are being collected. Thus 
approximately 2600 substances are in the 


TDB file. The TDB contains more than .” 


60 different data elements (for example, 
Synonyms, molecular formulas, and 
mechanisms of action), 

A variant is the recently developed 
Laboratory Animal Data Bank (LADB). 
It obtains data directly from participating 
laboratories rather than from published 
literature. Numeric data bases such as 
LADB typically permit interactive statis- 
tical analysis of the stored data and have 
computational and report-generation ca- 


sources from which the data were taken. 004441704 hOIAROPIOGOSOSROI0 100020005-Hinicomputer. The content 
The file caRRTaved bab elease 2 mal iF planners, and researchers in 


making decisions. Using LADB, a scien- 
tist may (i) select and examine physio- 
logic and pathologic baseline data for 
various groups of animals; Gi) determine 
the environmental and husbandry condi- 
tions for each animal group selected; (iii) 
Statistically analyze the obtained data; 
and (iv) cause the data to be printed as 
distributions (such as histograms) or as 
complete reports. 


“Knowledge Bases 


The newest form of automated infor- 
-Mation retrieval system is based not on 
bibliographic records or numeric data, 
but on analyzed and synthesized 
“knowledge.” In the Hepatitis Knowl- 
edge Base, a prototype of this kind of 
asystem, knowledge pertaining to viral 
hepatitis is synthesized from recent re- 
views by experts in the field (6). Rele- 
vant information is selected, placed in a 
highly organized hierarchical arrange- 
ment to permit easy retrieval, and en- 


SOURCE IDENTIFICATION NIOSH/AB1060000 
PRIME NAME ACENAPHTHENE, 5-NITRO~ 
CAS REGISTRY NUMBER 602-87-9 


CLASS OF COMPOUND 
TOXTCOLOGY/CANCER REVIEW 


CARC INQGEN-NEOPLASTIGEN 
CARCINOGENIC DETERMINATION: ANIMAL POSITIVE 


of the data base is validated by con- 
sensus of a group of ten experts in the 
field of viral hepatitis. Dissenting points 
of view are included when they reflect 
current understanding. This knowledge 
base is updated monthly. 

The Electronic Information Exchange 
System, an experimental computer con- 
ference network developed by the New 
Jersey Institute of Technology and sup- 
ported by the National Science Founda- 
tion (NSF), serves as the principal medi- 
um of communication linking experts 
with one another and with the NLM staff 
(7). 


Limitations of Retrieval Systems 
f Use of these systems is increasing as 
i they are improved and become better 


j known. However, the systems are not 


yet widely used. Even when used, they 
pare frequently not exploited to their full- 
; &St capacity. Part of the difficulty is that 
even though many of the systems do not 
sTequire that the users be formally 
. trained, most searches are performed by 
trained ‘“‘search analysts.’’ Such delega- 
; tion of the search function is inevitable 
‘ given the complexity of and differences 
: among existing services, As Williams (&) 
noted, 


i 
{ 
i 


the data bases vary with respect to subject 


7 coverage, source types (journals, mono- 
: graphs, patents, theses, book reviews, etc.), 


TARC** IARC MONOGRAPHS ON THE EVALUATION OF 3 
i file format, record format, data elements in- 


CARCINOGENIC RISK OF CHEMICALS TO MAN. ? cluded, and indexing or vocabulary prac- 
* tices... . On-line systems vary with respect 
16,319,78 ; f 
; tocommand languages, protocols, and system 
STATUS NCI CARCINOGENESIS BIOASSAY COMPLETED ; responses. ae tay systems vary with . 
2 spect to search features, system features, an 
AS OF SEPT 1978 * output formats. 
SYNONYMS ACENAPHTHYLENE, 1,2-DIHYDRO-5-NITRO- i This variability is a serious hindrance, 
SYNONYMS 1,2-DIHYDRO-5-NITRO-ACENAPHTHYLANE : since it is unlikely that users will become 
SYNONYMS 5-NAN i familiar with all the files and systems 
. s they might need. Even trained searchers 
STNONYHMS NCT-CO1967 find it difficult to be fully conversant with 
SYNONYMS 5-NITROACENAPHTHENE j several data bases and on-line systems. 
SYNONYMS. 5-NITROACENAPTHENE ‘ fe elias eee pea : baer aes 
: se of these systems S. 
SYNONYMS 5-NITRONAPHTHALENE ETHYLENE apie ee ce 


Paradoxically, the superhuman speed 
and flexibility of computer searching of- 
ten overwhelms the searcher with instant 
bibliographies of hundreds or sometimes 
thousands of citations, creating a formi- 
dable reading burden. Critical perusal 


C12-H9-N-02 
199.22 

L566 1A LT&&d HNW 

TJIDAH Toky] Jiketkai Ika Daigaku Zasshi. 
Tokyo Jikeikai Medical Journal. 89,475,74 : may be made difficult by the lack of in- 
ORAL ;RATSRODENTS;TDLO5120 gn/kg/17W-C5 TOXIC’ ao een Sees HEIRS Tee 
EFFECTS: CARCINOGENIC + Moreover, it is likely that the search 
BUCAAI British Journal of Cancer. 30,481,74 will not yield 100 percent recall (the pro- 
ORAL ; HAMSTER sRODENTS: TDLo ;504 GM/kg/24-C * portion of relevant documents retrieved 
TOXIC EFFECTS;NEOPLASTIC ‘ 


* from all the potentially relevant refer- 

ences) or provide 100 percent precision 

“Approved For R&(esee" Soba’ RI4'Y CLACR EP 90-00509R000t00020006. relevant documents in 
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MOLECULAR FORMULA 
MOLECULAR WEIGHT 
WISWESSER LINE NOTATION 
TOXIC DATA SOURCE 


a 


TOXDATA KEYWORDS 


TOXIC DATA SOURCE 
TOXDATA KEYWORDS 


Semin 
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* practice, precision an are in: 
; versely related and are rather clusive 
measures in operational retrieval sys- 


; tems because of the subjectivity of rele- 


vance judgments. The scientist can be 
further frustrated when he or she cannot 
locate the articles listed in the printout. 

Admittedly, none of these problems is 
unique to computer searching. Nonethe- 
less, until on-line retrieval systems sur- 
pass their conventional manual counter- 
parts in overall ease, quality, and cost- 
effectiveness, many scientists wil! con- 
tinue to rely on the more traditional 
means of gathering information. 


Directions in Research and Development 


Much of the recent research and de- 
velopment in the field of information sci- 
cence has been directed toward over- 
coming the remaining barriers to ef- 
fective use of large computerized retriev- 
al systems. Significant accomplishments 
can be noted in three critical areas: 0) 
data base and terminology selection, (ii) 
common access and retrieval protocols, 
and (iii) development of natural lan- 
guage-user interface techniques. 


Duta Base and Terminology Selection 


Because of the proliferation of data 
bases and the interdisciplinary nature of 
much scientific research, the selection of 
appropriate data bases is an important 
step in the search process. This has led 
the commercial data base vendors to of- 
fer data base selection aids within their 
individual search systems. Examples are 
the DIALIST printed indexes to the 
many files in the Lockheed DIALOG 
system, the Data Base Index on-line file 
of the SDC ORBIT system, and the BRS 
CROSS file index search capability. 

Developmental work for generalized 
intersystem data base selection has been 
pursued at the Coordinated Science Lab- 
oratory at the University of Illinois (9), 
and a prototype Chemical Data Base Di- 
rectory (CDBD) system has been imple- 
menied at the NLM. The CDBD is a 
switching file for the emerging chemical 
substances information network that is 
being designed to effect the logical in- 
tegration of diverse data bases and 
search systems, 

As noted earlier, it is difficult to 
choose all the appropriate search terms 
for a topic in a specific file or across dif- 
ferent files. Thesauri are designed to as- 
sist in this task, but they are limited in 
scope and quickly become dated. Dosz- 
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leasibility of automatically identifying 
and displaying terminology associations 
for very large files such as MEDLINE 
and TOXLINE. The associations are 
based on the observed and expected fre- 
quency with which terms appear in re- 
trieved sets of records. The Associative 
Interactive Dictionary (AID) software 
can also be used to capture automatically 
and display a variety of other associa- 
tions, such as the interest profile of a giv- 
en researcher or a ranked list of chemical 
substances associated with a specific ad- 
verse biological effect. 

The NSF-sponsored automatic sub- 
ject-switching work conducted at Bat- 
tele Columbus Laboratories (/1) is 
aimed at establishing topical linkages 
among entries in major thesauri. Link- 
ages have been developed among entries 
in the Thesaurus of Engineering and Sci- 
entific Terms, the Defense Technical In- 
formation Center Thesaurus, and others. 
The ultimate objective is to help search- 
ers overcome the inherent variability and 
ambiguity of language. 


The multiplicity of* bibliographic 
search systems poses a major obstacle to 
convenient and effective on-line access 
to the scientific literature. Despite a no- 
ticeable trend toward standardization of 
retrieval systems, important differences 
continue to prevail. The Conversion for | 
Network Information Transfer (CONIT) . 
research project at the Massachusetts In- _ 
stitute of Technology (/2) demonstrated ; 
the feasibility of a network access and * 
retricval interface to four different opera- 
tional on-line search systems. CONIT in- 
corporates a common retrieval language, 
hidden access protocols, and extensive 
instructional dialogue. 


ee roe ee | 


Natural Language—User Interface 

A variety of man-machine interface 
techniques have been developed to pro- 
vide access to well-defined and highly 
structured computerized data bases, 
such as the numeric data bases and gen- 


MEDLINE CURRENT INFORMATION TRANSFER IN ENGLISH 


PLEASE ENTER YOUR SEARCH QUESTION 


Recombinant GNA suidelines at NIH and other research institutions 


973760 RECURDS SEARCHED 
93 CITATIONS FOUND 


BEREEHE TERED PEE EEL BEDE ESE DIES PHUSHELR USER REE SERS EAL ESEPESED SE 


#23 RECURD NUMBER 


1 WEIGHT= 20 MAXIMUM KCIGHT= 25 ana 


SHEESH BR ER BEE DEE RHE EE SEE EAE EES REE EES CRS EF 


CITATION & = 779010565’ 
AUTHOR = “Dickson D’ 
THLE = 

= guidelines Cneys]’ 
JOURNAL TETLE = “Nature’ 
PAGINATION =a “3 
PUBLICATION BATE = 74 May 797 
VOLUME-ISSLUE = ‘273° 


CONTINUE PRINTING 
yes 


“NIH confiras violation of recombinant DNA research 


(Y/N)? 


ESPERO HE ASE SEES BEERS BERURDSRED RESEND EEPES ELISE ESELAS ESE LEP REESE PS aES 


44 RECORD RUMBER 


2 HEIGHT= 18 HAXINUM REIGHT= 25) ree 
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* Stallzert dae base BGLOF Ralease 2008/7 176i8"'CACRDPBOLODSOIROBO 100020006201 methods will be require 


commonlyARRF, RYE plications. 


The methods range from structured 
menu selection (in which the user is of- 
fered a series of choices that narrow 
down the subject) to English-like query 
languages and sophisticated but restrict- 
ed language-understanding systems, The 
special challenge to producers of biblio- 
graphic and other textual search systems 
lies in the fact that the text portion of the 
documents is wrilten in English or some 
other versatile but redundant and ambig- 
uous natural language. There is little 
doubt that the scientist also prefers using 
the same natural language in expressing 
- search topics of interest. Experience 
shows that most scientists are unwilling 


. to learn the intricacies and subtleties of 
* access protocols, commiund languages, 


* Boolcan search strategy formulation, 


g and controlled vocabularies. 


A citation is unigue and unambipuous, 
but numerical data or analytic state- 
ments are dependent on context for their 
utility, Moreover, if we hope to achieve 
simplified retrieval systems that are 
based on the use of natural language, the 
ambiguities of syntax and grammar will 
have to be dealt with. Recent work at the 
NLM (/3) resulted in a prototype of an 
English language interface to MED- 
LINE, TOXLINE, and the Hepatitis 
Knowledge Base. No special training is 
required to use this system, named Cur- 
rent Information Transfer in English 
(CITE). Questions may be posed in En- 
glish; the software then searches for 
documents that contain all or most of the 
key terms in the query. By using a spe- 
cial algorithm based on combinatorial 
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retrieved records according to their rele- 
vancy. The searcher can select docu- 
ments of interest and can command the 
system to expand the search by finding 
other items that contain terminology 
similar to the selected citations. Figure 3 
shows an actual search on MEDLINE in 
which CITE was used, Although the sys- 
tem does not perform syntactic and se- 
mantic analysis (it cannot “think”’), its 
comparative simplicity and performance 
offer a genuine potential for vastly in- 
creased interactive access to the litera- 
ture of science by the scientists them- 
selves. 


Conclusions 


Although this article attempts to iden- 
tify the many advantages of full use of 
existing automated data bases, it also 
points out some of their shortcomings. 
There are many problems attendant on 
information retrieval systems in science 
and technology, and we have not dis- 
cussed all of them. For example, what 
criteria should be used to determine 
whether a datum or statement is a valid 
entry to a data base? The NLM tends to 
rely on the published literature, Other 
systems rely on patient records or expert 
opinions. But regardless of which is 
used, there immediately arises a second 
set of questions: What published litera- 
ture? Which patient records? Whose ex- 
pert opinion? These problems become 
even more complex when information in 
data bases is updated, As the number of 
data bases grows, more sophisticated 


to ensure complete updating of the many 
data bases. The criteria need not be thi 
same for all data bases, but they must be 
clearly defined in each case. Es 

There are those who believe that the 
increasing amount of scientific and tech- 
nical research will create a volume of in- 
formation so large as _to frustrate the 
very purpose for which it was created. Li 
this prediction is not going to become a 
reality, then a larger percentage of the 
Fesources now cxpended on generating 
scientific and technical information must 
clearly be invested in research on how tc 
handle the mass of information being 
generated. +s 
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On-23 May Science will publish an issue contai 
will provide a sample of some of the more signific 
The manuscripts have been prepared by leading i 
tive but also readable and interesting. Upper- 
the issue a valuable sample of applications of fundamental knowledge. 
The topics covered include: New Polymers; Conductive Polymers; 
posite Materials; Heterogeneous Catalysts: Gl 
Current, High Fields; New Magnetic Alloys; 


ning 20 articles devoted to Advanced Technology Materials. The issue 
ant work being conducted in the major industrial research laboratories. 
ndustrial scientists who have delivered texts that are not only authorita- 
division undergraduates, graduate students, and mature scientists will find 


Multipolymer Systems; Fiber Reinforced Com- 
assy Metals; High Strength Low Alloy Steels; Superconductors for High 
High Temperature Ceramics; Gas Turbine Materials and Processes; Dia- 
mond Technology; New 3-5 Compounds and Alloys; Molecular Beam Epitaxy; New Methods of Processing Semiconduc- 
tor Wafers; Materials in Relation to Display Technology; Photovoltaic Materials; Magnetic Bubble Materials; Josephson 
Device Materials; and Biomedical Materials. 
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